9: Summary statistics

In analyzing the New York City Motor Vehicle Collision dataset, several summary statistics can provide insights into the nature and impact of collisions across the city. Here are some meaningful summary statistics for this dataset we are planing to explore:

  1. Average Number of Collisions Per Day: This statistic helps understand the daily frequency of collisions, providing a baseline for identifying days with unusually high or low numbers of incidents. It’s a key indicator of overall traffic safety.

  2. Median Number of Persons Injured in Collisions: The median gives a better sense of the typical collision severity by showing the middle value of injuries in all reported collisions. It’s less influenced by extreme values than the mean, making it a reliable measure of typical outcomes.

  3. Percentiles for Number of Fatalities in Collisions Percentiles (such as the 90th, 95th, and 99th) for fatalities can help identify the severity distribution of the most lethal collisions. Understanding the tail of this distribution is crucial for targeted interventions on the most dangerous incidents.

  4. Average Number of Pedestrians, Cyclists, and Motorists Involved in Collisions Breaking down the average number of pedestrians, cyclists, and motorists involved in collisions can highlight which road users are most at risk. This can inform targeted safety campaigns or infrastructure improvements.


import warnings
warnings.filterwarnings('ignore')
import pandas as pd

# Load the dataset
data_path = 'output/datasets/dataset_cleaned.csv'
data = pd.read_csv(data_path)


# Display basic summary statistics for numerical columns
summary_stats = data.describe()

# Displaying the results
print(summary_stats)
           latitude     longitude  number_of_persons_injured  \
count  1.760105e+06  1.760105e+06               1.760089e+06   
mean   4.072425e+01 -7.391972e+01               3.136728e-01   
std    7.916310e-02  8.586757e-02               6.987703e-01   
min    4.049895e+01 -7.425496e+01               0.000000e+00   
25%    4.066829e+01 -7.397446e+01               0.000000e+00   
50%    4.072096e+01 -7.392689e+01               0.000000e+00   
75%    4.077010e+01 -7.386693e+01               0.000000e+00   
max    4.091288e+01 -7.370055e+01               4.300000e+01   

       number_of_persons_killed  number_of_pedestrians_injured  \
count              1.760077e+06                   1.760105e+06   
mean               1.474367e-03                   5.923056e-02   
std                4.044532e-02                   2.495776e-01   
min                0.000000e+00                   0.000000e+00   
25%                0.000000e+00                   0.000000e+00   
50%                0.000000e+00                   0.000000e+00   
75%                0.000000e+00                   0.000000e+00   
max                8.000000e+00                   2.700000e+01   

       number_of_pedestrians_killed  number_of_cyclist_injured  \
count                  1.760105e+06               1.760105e+06   
mean                   7.397286e-04               2.834831e-02   
std                    2.770540e-02               1.679093e-01   
min                    0.000000e+00               0.000000e+00   
25%                    0.000000e+00               0.000000e+00   
50%                    0.000000e+00               0.000000e+00   
75%                    0.000000e+00               0.000000e+00   
max                    6.000000e+00               4.000000e+00   

       number_of_cyclist_killed  number_of_motorist_injured  \
count              1.760105e+06                1.760105e+06   
mean               1.170385e-04                2.219288e-01   
std                1.087019e-02                6.577966e-01   
min                0.000000e+00                0.000000e+00   
25%                0.000000e+00                0.000000e+00   
50%                0.000000e+00                0.000000e+00   
75%                0.000000e+00                0.000000e+00   
max                2.000000e+00                4.300000e+01   

       number_of_motorist_killed  collision_id  
count               1.760105e+06  1.760105e+06  
mean                5.914420e-04  3.335375e+06  
std                 2.648232e-02  1.389584e+06  
min                 0.000000e+00  1.579000e+03  
25%                 0.000000e+00  3.260377e+06  
50%                 0.000000e+00  3.764076e+06  
75%                 0.000000e+00  4.239331e+06  
max                 4.000000e+00  4.721095e+06  
mean_injuries = data['number_of_persons_injured'].mean()
median_injuries = data['number_of_persons_injured'].median()
print("Mean number of persons injured:", mean_injuries)
print("Median number of persons injured:", median_injuries)
Mean number of persons injured: 0.31367277450174397
Median number of persons injured: 0.0
import plotly.express as px

# Plotting the number of persons injured in each incident
fig = px.histogram(data, x='number_of_persons_injured', title='Distribution of Persons Injured per Incident')
fig.show()
# Plotting incidents over time, assuming 'date/time' is properly formatted and cleaned
fig_time = px.histogram(data, x='date/time', title='Distribution of Incidents Over Time')
fig_time.show()

Average Number of Collisions Per Day: This statistic helps understand the daily frequency of collisions, providing a baseline for identifying days with unusually high or low numbers of incidents. It’s a key indicator of overall traffic safety.

data['date/time'] = pd.to_datetime(data['date/time'], errors='coerce')
print(data['date/time'].dtype)
import pandas as pd

# Assuming 'data' has been loaded and 'date/time' converted to datetime
# Group by date component of 'date/time'
daily_collisions = data.groupby(data['date/time'].dt.date).size()

# Calculate the average number of collisions per day
average_collisions_per_day = daily_collisions.mean()
print("Average Number of Collisions Per Day:", average_collisions_per_day)
datetime64[ns]
Average Number of Collisions Per Day: 425.55730174081236
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(daily_collisions.index, daily_collisions, marker='o', linestyle='-')
plt.title('Daily Collisions Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Collisions')
plt.grid(True)
plt.xticks(rotation=45)  # Rotate date labels for better readability
plt.tight_layout()
plt.show()

Median Number of Persons Injured in Collisions: The median gives a better sense of the typical collision severity by showing the middle value of injuries in all reported collisions. It’s less influenced by extreme values than the mean, making it a reliable measure of typical outcomes.

median_injuries = data['number_of_persons_injured'].median()
print("Median Number of Persons Injured in Collisions:", median_injuries)
Median Number of Persons Injured in Collisions: 0.0

Percentiles for Number of Fatalities in Collisions Percentiles (such as the 90th, 95th, and 99th) for fatalities can help identify the severity distribution of the most lethal collisions. Understanding the tail of this distribution is crucial for targeted interventions on the most dangerous incidents.

fatalities_percentiles = data['number_of_persons_killed'].quantile([0.90, 0.95, 0.99])
print("Fatalities at 90th, 95th, and 99th Percentiles:", fatalities_percentiles)
Fatalities at 90th, 95th, and 99th Percentiles: 0.90    0.0
0.95    0.0
0.99    0.0
Name: number_of_persons_killed, dtype: float64

Average Number of Pedestrians, Cyclists, and Motorists Involved in Collisions Breaking down the average number of pedestrians, cyclists, and motorists involved in collisions can highlight which road users are most at risk. This can inform targeted safety campaigns or infrastructure improvements.

avg_pedestrians = data['number_of_pedestrians_injured'].mean()
avg_cyclists = data['number_of_cyclist_injured'].mean()
avg_motorists = data['number_of_motorist_injured'].mean()

print("Average Number of Pedestrians Involved in Collisions:", avg_pedestrians)
print("Average Number of Cyclists Involved in Collisions:", avg_cyclists)
print("Average Number of Motorists Involved in Collisions:", avg_motorists)
Average Number of Pedestrians Involved in Collisions: 0.05923055726789027
Average Number of Cyclists Involved in Collisions: 0.02834830876567023
Average Number of Motorists Involved in Collisions: 0.22192880538376972
avg_data = {
    'Category': ['Pedestrians', 'Cyclists', 'Motorists'],
    'Average Involved': [avg_pedestrians, avg_cyclists, avg_motorists]
}
avg_df = pd.DataFrame(avg_data)

fig_bar = px.bar(avg_df, x='Category', y='Average Involved', title='Average Number of Pedestrians, Cyclists, and Motorists Involved in Collisions')
fig_bar.show()